Index Processing

Processing raw index entries raises several issues. Some high-level issues are described below with references to the example given in Figure 2; how these tasks can be realized is detailed in the next section.

Permutation. Index entries are sorted alphabetically (Figure 2.c). The index processor must differentiate among different types of keys such as strings, numbers, and special symbols. Upper and lower case letters should be distinguished. Furthermore, it may be necessary to handle roman, arabic, and alphabetic page numbers.

Merging. Different page numbers corresponding to the same index key are merged into one list. Also, three or more successive page numbers are abbreviated as a range (as in the case of alpha, iv, 1-3, Figure 2.c). If citations on successive pages are logically distinct, good indexing practice suggests that they should not be represented by a range. Our system allows user control of this practice.

Subindexing. Multi-level indexing is supported. Here, entries sharing a common prefix are grouped together under the same prefix key. The special symbol `!' serves as the level operator in the example (Figure 2.a and 2.b). Primary indexes are converted to first level items (the \item entries in Figure 2.c) while subindexes are converted to lower level items (e.g., \subitem or \subsubitem entries in Figure 2.c).

Actual Field. The distinction between a sort key and its actual field is made explicit. Sort keys are used in comparison while their actual counterparts are what end up being placed in the printed index. In the example, the `@' sign is used as the actual field operator, which means its preceding string is the sort key and its succeeding string is the actual key (e.g., the \index{alpha@{\it alpha\/}} in Figure 2.a). The same sort key with and without an actual field are treated as two separate entries (cf. alpha and alpha in the example). If a key contains no actual operator, it is used as both the sort field and the actual field.

The separation of a sort key from its actual field makes entry sorting much easier. If there were only one field, the comparison routine would have to ignore syntactic sugar related to output appearance and compare only the ``real'' keywords. For instance, in {\it alpha\/}, the program has to ignore the font setting command \it, the italic correction command \/, and the scope delimiters {}. In general, it is impossible to know all the patterns that the index processor should ignore, but with the separation of the fields, the sort key is used as a verbatim string in comparison; any special effect can be achieved via the actual field.

Page Encapsulation. Page numbers can be encapsulated using the `|' operator. In the example, page 14 on which \index{beta} occurs is set in boldface, as represented by the command \bold. The ability to set page numbers in different fonts allows the index to convey more information about whatever is being indexed. For instance, the place where a definition occurs can be set in one font, its primary example in a second, and others in a third.

Cross Referencing. Some index entries make references to others. In our example the alphabeta entry is a reference to beta, as indicated by the see phrase. The page number, however, disappears after formatting (Step IV), hence it is immaterial where index commands dealing with cross references like see occur in the document. This is a special case of page encapsulation (see{beta} appears after the `|' operator). Variations like see also, which gives page numbers as well as references to other entries, work similarly.

Input/Output Style. In order to be formatter- and format-independent, the index processor must be able to handle a variety of formats. There are two reasons for considering this independence issue in the input side: Raw index files generated by systems other than LATEX may not comply to the default format, and the basic framework established for processing indexes can also be used to process other objects of similar nature (e.g., glossaries). But these other objects will certainly have a different keyword (e.g., \glossaryentry as opposed to \indexentry) in the very least. Similarly in the output side the index style may vary for different systems. Even within the same formatting system, the index may have to look differently under different publishing requirements. In other words, there must be a way to inform the processor of the input format and the output style.